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Abstract : 

An  Information  processing  model  of  some  Important  aspects  of  Inductive 
reasoning  Is  presented  within  the  context  of  one  scientific  discipline. 
Given  a  collection  of  experimental  (mass  spectrometry)  data  from 
several  chemical  molecules  the  computer  program  described  here 
separates  the  molecules  Into  "wel 1 -benaved"  subclasses  and  selects 
from  the  space  of  all  explanatory  processes  the  "character  1  Stic" 
processes  for  each  subclass.  The  definitions  of  "wel 1 -behaved"  and 
"characteristic"  embody  several  heuristics  which  are  discussed.  Some 
results  of  the  program  are  discussed  which  have  been  useful  to  chemists 
and  which  lend  credibility  to  this  approach. 
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I  NTRODUCT I  ON 


Induction  In  science  has  been  understood  to  encompass  many  different 
levels  of  tasks#  from  theory  construction  as  performed  hy  Einstein  to 
evervday  non-deduct  I ve  Inferences  as  made  hy  scientists  looking  for 
explanations  of  routine  data.  For  the  most  part#  It  Is  not  well 
defined  however  one  understands  It  (a  notable  exception  being 
statistical  Inference).  Although  general  statements  can  be  made  ahout 
non-deduct 1 ve  Inference#  It  Is  unlikely  that  there  exists  one  general 
"Inductive  method"  that  "uldes  scientific  Inference  at  all  levels. 

Mor  Hoes  It  seem  likely  that  a  method  of  scientific  Inference  at  any 
one  level  can  succeed  without  recourse  to  task-specific  Information# 
that  Is#  Information  specific  to  the  particular  science.  Within 
these  assumptions  we  are  exploring  an  Information  processing  model  of 
scientific  Inference  In  one  discipline. 

A  unifying  theme  In  our  explorations  Is  that  Induction  Is  efficient 
selection  *rom  the  domain  of  all  possible  answers.  Previous  papers 
on  the  Ken-Jstlc  HFNDPAL  Program  (1)  have  advanced  this  theme  wl th 
respect  to  hypothesis  formation  In  routine  scientific  work.  Recently# 
we  have  heen  exploring  this  theme  with  respect  to  the  higher-order 
task  of  finding  general  rules  to  explain  large  collections  of  data  (2). 
This  paper  extends  the  previous  work,  to  the  task  of  finding  rules  for 
subclasses  of  ohjects#  given  empirical  data  for  the  objects  but 
without  prior  knowledge  of  the  number  of  subclasses  or  the  features 
that  characterize  them. 
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THF  task  arfa 


For  reasons  discussed  previously  (2),  the  task  area  Is  mass  spectrometry, 
a  branch  of  organic  chemistry.  The  rule  formation  task  Is  to  find 
rules  that  characterize  the  behavior  of  classes  of  molecules  In  the 
mass  spectrometer,  given  the  mass  spect romet r I c  data  from  several 
known  molecules. 

The  chemical  structure  of  each  molecule  Is  known.  The  data  for  each 
molecule  are  a)  the  masses  of  various  molecular  fragments  produced  from 
the  electron  bombardment  of  the  molecule  In  the  Instrument  and  b)  the 
relative  abundances  o*  fragments  at  each  mass.  The  data  for  each 
molecule  are  arranged  In  a  f ragmen t-rvss  tahle  ( FMT) ,  or  mass  spectrum. 
Typically,  there  are  50-100  data  oolnts  In  one  FMT.  The  task  Is  to 
characterize  the  experimental  behavior  of  the  whole  cla',s  of  molecules. 


Pules  which  characterize  the  behavior  of  the  molecules  are  represented 
as  conditional  sentences  In  our  system.  The  antecedent  of  a  simple 
conditional  rule  Is  a  predicate  which  Is  true  or  false  of  a  molecule 
(or  class  of  molecules);  the  consequent  Is  a  description  cf  a  mass 
spect  romet  r  Ic  action  (henceforth  ’’process")  which  Is  thought  to  occur 
when  that  molecule  Is  In  the  experimental  context.  We  have  termed 
these  rules  "situation-action  rules"  (or  "S-A  rules").  The  rule 
syntax  has  been  described  previously  (3)  and  Is  not  critical  to  an 
understanding  of  the  rresent  paper. 


An  example  of  a  rule,  rewritten  In  Fngllsh#  Is:  "IF  the  praph  of  the 
nolecule  contains  the  estrogen  skeleton,  TMFM  break  the  bonds  between 
nodes  labeled  1*-17  and  14-15."  This  process  (the  consequent  of  this 
rule)  Is  named  fi R K 1  ^ L  In  Table  I.  The  traph  of  the  estrogen  skeleton 
mentioned  In  the  antecedent  Is  shown  with  the  conventional  node 
numbering  Ip  Figure  3. 

T1-'"  rules  will  he  used  In  the  Heuristic  DFNDPAL  performance  program 
to  determine  the  structure  of  compounds#  reasoning  from  the  mass 
spec t romet r 1 c  data  of  each.  They  are  also  of  use  to  chemists 
Interested  In  extending  the  theory  of  mass  spect rometry. 

ovfbvifw  nr  mfthcd 


The  rule  formation  program  contains  three  major  sub-programs#  which 
are  described  below  under-  the  headings  Data  Interpretation#  Process 
Selection#  and  Molecule  Selection.  The  control  structure  for  the 
overall  program  Is  described  after  the  discussions  of  the  three 
major  suh-pro^rams .  A  brlaf  overview  of  the  whole  program  will  be 
^Iven  Hrst#  however#  In  order  to  set  the  context. 

The  purpose  of  the  program  Is  to  find  the  characteristic  processes 
which  dptprmlne  separable  subclasses  of  molecules  e.lven  the  experimental 
data  and  molecular  structure  of  each  molecule.  The  overall  flow  of 
the  program,  as  described  below#  is  shown  In  Figure  1.  The  three 
major  steps  are  to  reinterpret  the  experimental  data  as  molecular 
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Processes,  *lnd  the  .haracter l st tc  orouesses  for  the  riven  molecules, 
and  select  the  set  of  molecules  that  are  "well-behaved"  with  regard 
to  the  character l st Ic  processes.  The  re  I nteroretat Ion  o*  the  data  Is 
done  once  for  each  molecule  In  the  whole  set,  and  the  results  are 
summarized  once.  The  second  and  third  suh-programs  are  called 
successively  until  they  Isolate  a  well-behaved  subclass  of  molecules 
and  determine  the  processes  which  characterize  their  behavior.  The 
monitor  then  subtracts  the  well-behaved  suhclass  from  the  starting 
class  of  molecules,  and  repeats  the  successive  calls  to  the  second  and 
third  subprograms.  The  whole  program  stops  when  there  are  N  or  fewer 
molecules  not  yet  In  some  well-behaved  subclass.  (For  now,  N*3.) 

The  data  Interpretation  program  has  been  described  previously  with 
some  aspects  of  the  process  selection  program  (3),  The  molecule 
selection  program  and  class  refinement  loop  In  the  control  sequence 
are  new  additions. 

DATA  I NTEPPRFTAT I  ON 

As  mentioned  above,  the  nurpose  of  the  data  Interpretat Ion  and  summary 
program  (INTSUM)  Is  to  reinterpret  the  experimentally  determined  data, 
the  FMT,  for  each  molecule  and  summarize  the  results.  Because  the 
program  has  been  described  previously  (3),  details  will  be  omitted 
here.  It  should  be  noted  that  the  successful  application  of  this 
program  to  a  sub-class  of  estrogens  has  already  been  reported  In  the 
chemical  literature  (4).  The  INTSUM  program  Is  general  In  that  It 
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will  work1  on  FMT's  for  any  class  of  molecules  with  a  cormon  skeletal 
graph  and  It  Is  flexible  In  that  the  knowledge  used  hy  the  program  Is 
easily  changed  and  there  are  numerous  options  controlling  the  operation 
of  the  program. 

The  IMTSUM  program  Is  called  with  the  Initial  set  of  molecules  and 
their  FMT's.  It  Is  also  given  the  graph  structure  of  the  skeleton 
common  to  all  molecules  In  the  Initial  set.  Th?  first  step  Is  to 
search  the  space  of  all  possible  processes  which  could  explain  data 
points  In  the  FMT  of  any  molecule  with  the  given  skeleton.  The  space 
of  explanatory  processes  Is  combinatorial;  simple  processes  that  cut 
the  granh  Into  two  fragments  are  generated  first,  followed  by  pairs 
o*  simple  processes,  triples,  and  so  on.  The  heuristics  listed  below 
constrain  the  search: 

Simplicity  (Occam's  Razor) 

if  two  or  more  processes  explain  the  same  data  point,  prefer  the 
simpler  one,  l.e.,  the  process  Involving  fewer  simple  steps. 

Chemical  Constraints 

(a)  Break  no  more  than  NB  bonds  In  any  process,  whether  simple  or 
multi-step  (NB*5  In  our  current  version);  (h)  Do  not  allow  any  process 
to  break  two  bonds  to  the  same  carbon  atom;  (c)  Do  not  allow  a  fragment 
to  contain  fewer  than  NA  atoms  (MA«5  currently);  (d)  Do  not  allow  any 
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process  to  contain  more  than  NP  simple  processes  (NP«2  currently);  (e) 
Preak  only  single  honds  (no  double  or  triple  honds). 

Thp  hpurlstlc  search  produces  a  list  of  plausIMe  processes  v/lthout 
reference  to  the  data.  The  second  step  of  the  INTStJM  propram  Is  to 
determine  for  ea~h  oroce  ,*  and  each  Ff IT  whether  there  Is  evidence  for 

the  process  In  the  FMT.  If  so#  then  that  process  can  explain  the  data 

point  and  the  strength  of  the  evidence  Is  saved.  The  final  stpp  Is 
to  cumnarlze  for  each  process  and  all  molecules  the  frequency#  total 
strength  of  evlderc?  and  number  of  alternative  explanations.  frequency 
for  a  plven  process  Is  the  percentage  of  all  molecules  that  have  evidence 
for  the  process.)  These  statistics  are  passed  to  the  process 
selection  propram. 

PPOTFSS  SFLFCTION 

The  process  selection  propram  chooses  the  most  characteristic  processes 
for  the  plven  class  of  molecules  from  the  list  of  a  priori  plausible 
processes  that  are  output  by  the  INTSUM  propram.  It  assumes  that  the 

molecules  plven  to  It  are  all  In  one  well-behaved  class.  Thus#  It  can 

merely  filter  the  list  of  processes  to  find  those  which  satisfy  the. 
criteria  for  characteristic  processes. 

A  process  mentioned  In  a  rule  statement  must  satisfy  several  criteria 
In  order  to  be  counted  as  a  characteristic  process  for  the  molecules 
under  consideration.  The  INTSUM  propram  provides  a  summary  of 


statistics  *or  the  plausible  processes  It  has  chosen  from  the  ;oace  of 
all  processes.  The  process  selection  program  applies  heuristic 
criteria  to  iort  out  the  most  likely  processes  and  to  distinguish 
among  alternative  explanations#  when  alternatives  remain.  It  uses 
the  Information  from  the  data  for  filtering#  In  contrast  to  the  a 
priori  filtering  In  the  INTSIIM  program.  For  example#  an  a  priori 
simplicity  criterion  filters  out  processes  that  break  too  many  bonds. 
The  criteria  for  "most  likely  processes'1  —  frequency#  strength  of 
evidence#  and  degree  of  uniqueness  —  are  discussed  below.  To  a  large 
extent  the  choice  of  these  criteria  and  particularly  the  choice  of 
parameter  settings  are  arbitrary.  Howe  er#  the  following  discussion 
provides  some  rationale  for  our  choices. 

F  requency 

If  nature  presented  clear  and  unambiguous  data  to  us  we  could  expect 
all  and  only  characteristic  processes  for  a  class  of  molecules  to 
occur  1  fl 0 of  the  time.  This  Is  what  we  would  like  to  mean  by 
' character  I st  Ic  '  process.  However,  the  data  contain  noise  and#  more 
Importantly#  we  are  *orced  to  interpret  the  data  In  terns  of  processes 
that  we  construct.  Thus#  In  the  literature  one  finds  discussions  of 
exceptions  to  rules  together  with  presentation  of  the  rules.  A  low 
frequency  threshold  (Rfl*)  Is  used  as  a  criterion  for  plausible  process 
Instead  of  a  high  one  because  the  marginal  processes  which  are  Included 
at  one  step  can  be  excluded  at  a  later  refinement  step  If  they  prove 
to  he  uncharacteristic  of  a  class  of  molecules. 


T 


Strength  of  Evidence 


The  program  considers  the  strength  of  evidence  found  for  each  procens, 
besides  the  frequency  of  molecules  that  show  the  process.  Associated 
with  each  fragment  mass  In  the  experimental  data  Is  a  measure  of  the 
percent  of  total  Ions  (or  Ion  current)  contributed  hy  fragments  of 
that  mass.  (The  evidence  from  mass  spectrometry  Is  not  merely  binary, 
l.e.,  yes/no,  although  we  have  considered  It  that  way  In  the  past.) 

The  total  Ion  current  *or  any  molecule  can  be  visualized  as  the  sum  of 
all  y-values  In  a  bar  graph  In  which  the  x-values  represent  fragment 
masses.  The  strength  of  evidence  *or  a  process,  then.  Is  the  percent 
of  the  total  of  all  Ion  currents  (for  all  molecules)  that  can  be 
explained  by  the  process.  The  present  value  of  this  parameter  Is 
fl.OOS,  l.e.,  0.5*  of  the  data  must  be  explained  by  any  process  that 
will  be  said  to  be  cha racter 1 st I c  of  the  given  molecules. 

There  may  be  much  Information  In  the  weaker  data  points,  but  until  we 
can  Interpret  the  strong  signals,  we  do  not  want  to  sta.t  looking 
critically  at  the  weak  ones.  This  Is  why  we  have  a  str°ngth  of 
evidence  threshold  (although  In  our  trials  we  have  kept  It  fairly  low). 

Degree  of  Uniqueness 

The  program  will  discard  processes  that  cannot  uniquely  explain  at 
least  n  data  points  for  each  molecule.  The  rationale  behind  this 
criterion  Is  that  processes  that  are  always  (or  often)  redundant  with 
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othpr  processes  have  no  explanatory  power  of  their  own.  In  spite  of 
the  Intuitive  appeal  of  this  criterion.  It  was  not  used  for  the  trials 
reported  here  In  which  molecule  selection  Is  coupled  with  process 
selection.  For  process  selection  alone.  It  Is  a  useful  filter. 

These  three  criteria  liter  the  processes  to  provide  the  cha rac ter  1 st I c 
processes  for  the  molecules  given  to  the  program,  However,  the 
processes  may  still  averlap  In  the  data  points  that  they  explain.  If 
two  (or  more)  processes  are  ambiguous,  l.e.,  they  explain  most  of  the 
samp  data  points,  the  program  tries  to  resolve  the  ambiguity  jn  favor 
of  a  single  explanation.  This  Is  not  easy,  for  the  competing 
explanations  have  all  passed  the  tests  for  "most  llkel,  processes" 

just  discussed.  Thus,  they  all  appear  good  enough  to  be  rules  on  their 
own . 


The  resolution  of  ambiguities  among  processes  Is  made  according  to 
relative  values  of  the  criteria  used  to  Judge  them  likely  In  the 
first  place.  That  Is,  the  values  of  frequency,  strength  of  evidence 
and  degree  of  uniqueness  are  compared  -  In  any  order  -  to  determine 
which  process  Is  preferred.  If  any. 

MOLECULE  SELECTION 

Molecule  selection,  by  Itself,  Is  a  simple  program  whose  purpose  Is  to 
f I nrl  n  subclass  of  molecules  that  are  "well-behaved"  with  respect  to 
a  set  of  processes.  Its  Inputs  are  (a)  a  class  of  molecules  and  (b) 
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a  set  of  processes  that  are  characteristic  of  those  molecules  (output 
of  the  process  selection  program  just  described). 

The  processes  that  are  chosen  as  roughly  characteristic  of  a  class  of 
molecules  are  used  by  the  molecule  selection  program  to  refine  the 
extension  of  the  class.  Several  processes  will  each  have  a  few 
exceptions  “  the  numher  permitted  depending  on  the  frequency  threshold 
used  by  the  program.  But  If  the  same  molecules  appear  as  exceptions 
over  and  ove  again  (for  several  processes)  then  they  probably  do  not 
belong  In  the  same  suhclass  with  the  molecules  whose  behavior  is 
character  I ged  by  those  processes. 

A  molecule  Is  scid  to  he  well-behaved  with  respect  to  a  set  of 
processes  (rr  wel  1  -hehaved)  |f  It  shows  evident-*  for  at  l'vcst  MR  of 
the  processes.  The  current  value  of  MP  Is  85*  of  the  numher  of 
processes  In  th*  set.  Currently  this  Is  the  only  criterion  used  to 
Identify  memhers  of  the  subclass,  although  other  features  of  the 
molecu  es  could  also  be  used  for  clustering.  For  example,  the 
structural  features  of  chemical  molecules  could  also  help  classify 
molecu’es  which  "belong"  together.  The  reason  descriptive  features 
such  as  these  are  not  used  during  molecule  selection  Is  that  they 
constitute  a  good  check  (by  chemists)  on  the  adequacy  of  the  results 
of  the  molecule  separation  procedure. 

CONTROL  STRUCTURE  OF  TWF.  RULE  FORMATION  MONITOR 

The  overall  flow  of  control  has  been  briefly  described  and  diagrammed 
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In  I Igure  1,  and  the  three  major  components  of  the  whole  propram  have 
been  discussed.  The  Interaction  between  process  selection  and  molecule 
selection  Is  the  last  Important  detail  Ir.  the  description  of  the 
program.  It  Is  shown  schematically  In  Figure  2  and  selected  portions 
of  Intermediate  output  are  shown  In  Tahir  |l. 

After  the  INTSUM  program  Interprets  and  summarizes  the  data  for  a  set 
of  molecules,  the  process  selection  program  Is  asked  to  find  a  set  of 
processes  that  characterize  those  molecules.  However,  process 
selection  starts  with  the  assumption  that  the  molecules  should  he 
characterized  all  together,  l.e.,  that  the  molecules  are  homogeneous, 
or  belong  In  one  class  wl th  respect  to  mass  spectrometry.  The  purpose 
of  the  rule  formation  monitor,  and  the  molecule  selection  program  In 
particular.  Is  to  remove  the  necessity  of  working  within  this 
assumption.  Because  a  class  of  molecules  has  a  common  skeleton,  there 
Is  reason  to  believe  that  they  are  homogeneous  (with  respect  to  mass 
spectrometry  processes).  But  this  Is  not  necessarily  true.  Many  of 
the  molecules  whose  struct  ires  contain  the  graph  common  to  estrogens 
(e.g.,  the  equllenlns  discussed  with  Table  II  In  the  Results  section) 
fal  to  exhibit  behavior  that  Is  cha racter 1 st 1 c  of  most  estrogens  In 
the  mass  spectrometer. 

The  monitor  begins  with  the  Null  Hypothesis  that  the  Initial  set  M 
of  molecules  Is  homogeneous  with  respect  to  all  the  relevant  processes 
given  as  Input.  With  the  process  selection  program  It  finds  plausible 
p  ocesses  that  roughly  characterize  the  whole  class  of  molecules.  It 


11 


attempts  to  conf  I  rn  the  hypothesis  hy  finding  the  subclass  S  of 


molecules  that  are  well-behaved  for  those  processes.  If  this  subclass 
S  Is  the  same  as  the  Initial  set  M,  then  the  assumption  of  homogeneity 
Is  taken  to  he  true.  In  that  case,  there  Is  no  proper  subset  to  be 
separated. 

When  the  subclass  S  Is  different  from  the  starting  class  M,  however, 
the  program  loops  back  to  process  selection  as  shown  In  Figure  2. 

This  figure  shows  the  procedure  for  producing  one  homogeneous  subclass 
of  molecules  (and  the  character  I st Ic  processus  for  the  subclass);  this 
procedure,  rule  formation.  Is  Itself  used  repeatedly  In  the  main 
program  as  shown  In  Figure  1. 

The  Inputs  to  the  rule  formation  procedure  are  (a)  the  set  PP  of 
relevant  processes  and  statistics  for  them,  viz.,  the  output  of  IMTSIJM, 
and  (h)  a  class  M'  of  molecules,  where  M*  Is  Initially  the  same  as  the 
entire  class  of  molecules,  M,  given  to  INTSIJM.  M*  Is  used  to  keep 
track  of  the  best  refinement  of  M  so  far. 

The  process  selection  program  selects  a  set  of  processes  P  from  RP  In 
the  manner  described  above.  P  character  I zcs  the  class  M',  Insofar 
as  M '  can  be  character  I  zed  at  all.  The  criteria  for  character  I s 1 1 c 
process  can  be  made  more  restrictive  If  the  class  Is  known  to  be 
homogeneous  (e.g.,  frequency  >  R5?!).  In  this  case,  however,  the 
loose  criteria  listed  above  are  used  (e.g.,  frequency  >  60^)  In 
order  to  allow  many  exceptions  to  the  "character I st I c"  processes. 


The  molecule  selection  program  selects  a  subclass  of  molecules  S, 
from  M',  that  are  best  character  1  zed  by  the  processes  In  P.  The 
subclass  S  Includes  mo.ecules  that  show  evidence  for  most  (85?;  or  more) 
of  the  processes  In  P,  and  excludes  molecules  that  are  exceptions  to 
many.  Thus  5  Is  at  least  as  well  behaved  as  M'  with  respect  tc  •>. 

And  since  the  two  measures  of  selection  are  not  perfectly  compl ementary, 
S  Is  likely  to  he  better  behaved  than  M*  with  respect  to  P.  (If 
molecule  selection  uses  less  restrictive  measures  than  process 
selection,  then  5  will  he  less  well  behaved  than  M'  and  the  procedure 
will  fa 1 1  except  when  the  Initial  set  of  molecules  Is  homogeneous.) 

One  Interesting  part  of  the  procedure  Is  that  after  processes  are 
selected,  ALL  of  the  molecules  are  reclassified  wl th  regard  to  the 
number  of  times  they  appear  as  exceptions  to  the  processes.  This  Is 
shown  In  Figure  2:  at  step  2  of  each  level  all  molecules  In  the 
Initial  set,  M  (not  M '  or  5),  are  tested  against  the  processes.  Thus, 
a  molecule  can  be  excluded  at  one  level  (because  It  Is  an  exception 
to  too  many  o f  the  processes  at  that  lewr". ),  but  he  Included  aftaln  at 
another  level  for  a  slightly  different  set  of  processes. 

The  condition  under  which  we  want  the  program  to  stop  Is  that  the 
suhclass  S  of  molecules  after  an  Iteration  Is  the  same  as  the  class 
M*  from  which  the  Iteration  started  (condition  1  In  Fleure  2).  In 
other  words,  under  this  condition  the  program  has  found  an  S  and  a 
P  such  that  P  characterizes  55  (S»M')  and  S  Is  well-behaved  with 
respect  to  P.  The  subclass  S  Is  taken  to  he  homoeensous,  and  the 
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processes  In  P  can  he  taken  to  he  mass  spectronetry  rules  for 
molecules  In  S. 

The  refinement  level  In  Figure  2  l;  the  number  of  times  the  Procedure 
has  been  Invoked  In  tryln*  to  find  one  homogeneous  subclass  of 
molecules.  The  second  of  the  stopping  conditions  tests  whether  the 
refinement  level  Is  equal  to  an  arbitrary  maximum,  which  Is  currently 
3.  This  condition  Is  necessary  to  avoid  an  Infinite  loop  In  the  case 
where  the  program  can  find  no  subclass  5  that  Is  homogeneous  with 
respect  to  P.  The  level  3  has  been  observed  to  produce  fairly 
acceptable  results:  after  three  Iterations  through  this  loop,  the 
subclass  S  Is  ahout  as  refined  as  ft  will  Ket.  After  more  Iterations 
the  procedure  appears  to  oscillate  |>,  that  molecules  added  to  S  In 
one  Iteration  are  subtracted  from  S  In  a  later  Iteration.  Our 
experience  Is  very  limited.  because  there  Is  no  guarantee  that  the 
procedure  converges,  however,  some  stopping  condition  like  the 
maximum  refinement  level  Is  necessary. 

The  last  stopping  condition  shown  In  Figure  ?  tests  whethpr  thprp  arp 
enough  molecules  In  the  subclass  to  warrant  further  refinement.  If 
there  are  fewer  than  an  arbitrary  minimum  number  (-3)  of  molecules  In 
S,  then  further  refinements  will  he  unreliable.  This  minimum  Is  not 
completely  arbitrary,  since  It  depends  to  some  extent  on  the  frequency 
measures  used  In  process  and  molecule  selection.  Put,  Intuitively, 
when  the  number  of  molecules  In  S  Is  small  there  Is  little  value  In 
Kreakln<t  S  up  Into  subclasses  anyway. 


As  shown  In  the  overall  flow  diagram,  Hgure  l,  after  the  first  ma]or 
subclass  (S)  has  been  defined,  all  molecules  In  S  are  removed  from 
any  further  consideration  by  subtracting  them  from  M.  The  entire 
procedure  Is  then  repeated  with  the  new  M.  It  stops  only  when  there 
are  so  few  molecules  left  In  M  (3  or  fewer)  that  process  selection  is 
unreliable  and  molecule  selection  appears  pointless. 

The  output  of  the  whole  program  now  Is  merely  the  collected  set  of 
outputs  from  all  Iterations,  viz.,  thr  collected  S,P  pairs,  as  shown 
In  Figure  2.  Future  work  will  focus  on  automatically  generalizing 
the  descriptions  of  the  molecules.  This  Is  now  done  by  hand,  except 
when  the  Initial  class  M  Is  homogeneous  -  then  the  generalized 
description  Is  the  common  graph  structure. 


PFSIILTS 


The  1NTSIJM  prop-ram  alone  has  already  provided  USef„i  nPW  results  for 
chemists,  as  reported  In  the  chemical  literature  (4).  The  process 
selection  program,  working  with  output  from  IMTSUM  (but  without 
molecule  selection),  has  successfully  found  sets  of  character  1st Ic 
processes  for  a  wel l -understood  class  of  molecules  (estrogens. 

Figure  3)  and  for  classes  whose  be  I  svlor  Is  still  under  Investigation 
(e.*.,  equllenlns,  progesterones ,  a.  Ino  acids).  For  47  estrogens, 
which  were  assumed  by  both  an  expert  and  the  program  to  be  In  one 
rlass,  rules  found  by  the  program  agree  closely  with  rules  formed  by 
the  expert  from  the  same  data.  (This  result  Is  not  shown  In  a  table. 
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hut  the  conpa  r  I  son  with  the  expert’s  rules  looks  mujh  lHe  that  shown 
In  Table  1.)  Expert  chemists  have  made  suggestions  for  Improvements, 

hut  were  generally  In  agreement  with  the  processes  selected  hy  the 
Program. 

The  rule  formation  program  with  molecule  selection  has  been  tested  on 
several  sets  of  molecules.  The  results  of  running  the  program  on  a 
set  of  15  estrogens  (a  subset  of  the  47  mentioned  above)  are  shown 
In  Table  I.  The  program  separated  two  of  the  15  compounds  Into  a 
second  class  because  they  were  not  as  well  behaved  as  the  rest  -  they 
were  exceptions  to  about  20*  of  the  characteristic  processes.  However, 
the  chemist  thought  the  separation  was  reasonable.  The  processes 
selected  by  the  program  are  shown  with  Indications  o'  the  discrepancies 
between  the  program's  choices  and  the  chemist's.  The  discrepancies 
mostly  arose  from  the  program's  applying  n Ifferent  criteria  to  select 
one  process  from  viable  alternatives.  Table  II  shows  the  success  of 
the  molecule  separation  part  of  the  program  when  rule  formation  was 
done  on  data  from  iq  non-homogeneous  estrogenic  steroids.  The  major 
subclass  of  chemical  Interest  Ine  set  of  5  equllenlns  which  are 
Identified  hy  common  modifications  to  the  skeleton  shown  In  Figure  3. 
The  structural  properties  were  not  used  hy  the  program  although  the 
chemist  did  classify  the  compounds  hy  such  features.  By  selecting 
well-hehaved  subclasses  of  molecules  the  program  grouper!  four  or  five 
"equllenlns"  (molecules  #4,  8,  in,  iq)  and  three  "3-acetates" 

(*3,  11,  1R)  in  the  first  subclass.  ”:he  Mfth  »qullenln  (#2)  was 
removed  from  that  subclass  on  the  last  refinement  because  It  was  an 
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exception  to  1  of  1  cha rnr ter  1 st 1 c  processes  used  to  determine  the 
suhc lass. 

In  the  third  Iteration  shown  In  Table  II,  the  program  grouped  three 
of  the  chemist's  four  "3-henzoates"  together  (molecules  #12,  13,  14). 

In  the  fourth  Iteration  !v  grouped  together  the  chemist's  two 
"dlacetates"  and  one  "triacetate"  (molecules  #9,  15,  16).  Two  Iterations 
produced  subclasses  with  only  two  members  -  when  put  together  they 
encompass  two  "17-acetates"  (#1,  17),  one  "1 7-henzoate",  and  one 
gamma-lactone"  (#5).  The  two  molecules  remaining  unclassified  at 
the  end  of  the  procedure  were  the  last  "equllenln"  (molecule  #2)  and 
the  last  "3-benzoate"  (#6). 

CONCLUSIONS 

Building  an  Information  processing  model  of  scientific  reasoning  In 
mass  spectronet ry,  although  not  completed,  has  already  led  to 
Interesting  and  useful  results.  The  model  Incorporates  heuristic 
search  In  process  selection.  The  procedure  for  selecting  molecules 
can  he  thought  of  as  a  planning  procedure  Insofar  as  It  reduces  the 
problem  of  formulating  rules  for  a  class  of  diverse  molecules  to  a 
number  of  smaller  subproblems,  viz.,  formulating  rules  for  smaller 
classes  of  well-hehaved  molecules.  Mow over,  the  molecule  selection 

procedure  Is  highly  dependent  on  process  selection,  as  described  In 
detail. 
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Thft  Incompleteness  of  the  program  as  a  model  of  the  entire  rule 
formation  procedure  should  he  readily  apparent.  We  have  not 
described  anythin*  that  approximates  confrontation  of  rules  with  new 
data#  for  example.  Rut  as  the  results  section  Indicates,  the  program 
can  separate  subclasses  of  well-behaved  molecules  and  can  find 
charat ter  I st I c  processes  for  the  subclasses  with  »nou*h  accuracy  (on 

a  few  examples)  to  *afn  preliminary  acc  stance  hy  an  expert  In  the 
Held. 
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Figure  1.  OVERALL  FLOW  OF  PULE  FORMATION  PROORAM 


No 


*  Details 


INPUT:  List  of  Molecule  -  Data  Pairs 

I 

I 

PROGLAM:  INTSUM  -  Data  ! nterpretat I  on  and  Summary 

I 

I 

- ---->  list  of  Molecules,  m. 

List  of  Relevant  Processes,  RP,  with 
Summary  Statistics  for  Each  Process 

I 

I 

PPO^PAM;  Pule  Formation* 

I 

I 

Set  of  Character  I st Ic  Processes,  P  (P<c  RP), 
Class  of  Wei  1 -Rehaved  Molecules,  S  (S<s.M), 

I 

I 

SUBTRACTION  STEP:  Remove  a  I  Molecules  In  S 

from  M. 

I 

I 

STOPPINC  CONDITION:  M  contains  3  or  Fewer 

Mol ecul es. 

I 

I  Yes 
I 

STOP 

OUTPUT  «  All  S-P  pairs  found. 


In  Figure  2. 
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Figure  2. 


OETAIIS  OF  INTERACTION  RETWEEN  PROCESS  SF  LEFT  I  ON  AND 
MOLECULE  SELECTION  IN  THE  RULE  FORMATION  PROC.RAM 


INITIALIZE:  Refinement  Level  ■  0 

M  -  Original  class  of  molecules. 

M*  -  M. 

RP  ■  Relevant  processes  (from  INTSUM)  Including 
evidence  and  statistics  for  the  processes. 


INPUT: 


M*  ,  P.P 


•>  SUP-PROCRAM:  Process  Selection  (using  the  null  hvpothesls  that 

all  molecules  can  he  characterized  by  the  same  set 

of  processes) 


Set  of  processes,  P,  that  are  character  I st 1c  of 
M*  (PCRP) 


SUR-PROORAM:  Molecule  Selection 


Subclass  of  Molecules,  S,  selected  from  M  such  that 
every  molecule  In  S  Is  well-behaved  with  respect  to 
the  processes  In  P 


Increment  Refinement  Level 


Test  for  Stopping  Conditions: 

1.  S  *  M ' ,  or 

2.  Refinement  level  *  s ,  nr 

S.  Fewer  than  S  molecules  In  S 


YES  STOP. 

■ . >  OUTPUT 


-  SURCLASS  REFINEMENT:  Reset  M*  to  S  (M*  *  S). 
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TABLE  I. 


processes  selected  for  ir  estrogens 

PiXIFVED  TO  BE  IN  ONE  WELL-BEHAVED  CLASS 


PR3CES"  LABF  * 


PICTORIAL  rrrPtrTIQM 


1.  BRKO 


2.  BRK2L/19L 

(preferred  over 
BRKTL  and  BRK2L/l8l) 


3.  BRK6L  or  BRK2L/17T. 


It.  HRK10L 


5.  b.OQLL  or  BPJaBL 


6.  BRKl~L 


7.  3RK2L/10L 

(preferred  over 
BRKlBL) 


8.  BRELL 


9.  BPK5L  or  BPK13L 


0^* 

o5^ 


10.  BR5QOL/15H  or  EHIMI/20L 
or  EREl*:i/19L 

f 


•  The  underlined  processes  are  those  selected  hy  an  expert  chemist  on  the  basis  of 
estrogens,  including  these  IB. 


*  Or  ALL  DATA 
foiirr:  explainix 

22* 


lit* 


11* 


8* 


6* 


5* 


It* 


3* 


2* 


2* 


data  from  I*T  veil-behaved 
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TABLE  I,  Pwf  2 


pbo'tt"  TAm* 


PICTOBIM,  DEFCPimOH 


*  OF  ALL  DATA 

ponrrs  rffuiia 


11.  BRJQ1L 


12.  BRK2L/11L 

(preferred  over 

BRK20L) 


13.  BRK5H/10L 


ll*.  BRK5H/12L 


15.  BFK12L/15H  or 
BRK12L/1LH 


si 


si 


2 i 


i  i 


n 


TOTAL  FEBCD.T  OF  DAT/  EXPLAINED 


Ski 


•  The  underlined  processes  are  those  selected  by  an  expert  chenlst  on  the  basis  of  data  from  l»T  veil-behaved 
estrogens,  Including  these  15. 
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TABU  II 

SUMMARY  OF 
PROCFOURF 


ITFRATION  #1 
Initial  Set: 


First  Refinement: 


Second  Refinement: 


Third  Refinement: 
■*  Suhclass  1 


I TFPAT I  ON  #2 
Initial  Set 

(-  Subclass  1) 


Third  Refinement 
■  Suhclass  2 


STFPS  IN  THF  RtlLF  FORMATION 
WITH  IP  F.STPOOFNIC  STFPOIOS 

Molecules  Processes 


<1,2,3,. ..,10) 


> 


BRKO 

BRK10L 

BPK11L 

BRK20L 

RPK2L/10L 

RPKSUB3L/3L 

RRKSUR3L/12L 


(2,3,4,5,8,10,11,  IP)  . > 


rPKO 

BRK10L 

BPK11L 

RPK20L 

RPK00C3*1L 

BRKSUR3L/2L 

BPKSUB3L/23L 

RRKSUB18L/11L 

RRKO 

BRK10L 

RRK11L 

RRK20L 

BRK0C3* 1 L/ 1 1 L 
BRK00C3*  1 L 
RRKSUR3L/2L 
RPKSUB18L/11L 
RRKSUB3L/23L 


(3,4,8,10,11,18,10)  same 


( 1 , 2 , 5 , F , 7 , P , 1 2 , 1 3 ,  BRKO 

14, 15, IF, 17)  BRK1FL 

BRK2L/1PL 

BPKSUB3L/3L 


(5,17)  . >  BRKO 

BPK2L/19L 

RRK0C3*1L/8L 

RRK0C3*1L/17L 

BRK00C17*1L 


ITERATION  #3 


ThlrH  Refinement  (11,12,13,14)  - >  BRKO 

■  Subclas  3  RPKRT3*1H 

RRKBT3*1L/3L 

RPKSUR3L/3L 


ITERATION  #4 


Last  Refinement:  (9,15,16)  - >  RRKO 

*  Subclass  4  R»KOOC3*lL 


BRK00C3*1L/6L 

B°K00C3*1L/7L 

BOKOOC3*1L/RL 

ROKOOC3*1L/1BL 

BRK0003*1L/17L 

BRK00C17*1L 


ITERATION  #5 


Last  Refinement:  (1,7)  - >  BRKO 

«  Subclass  5  BPK6L 


RRK7L 

RRK8L 

RRK10L 

RRK11L 

RRK14!. 

RRK15L 

BRK16L 

RRK17L 

RRK2L/17L 

RR.K2L/10L 

BRKOOC17* 1 L 

RRKSUR17L 

RPKSIIR17L/1L 


UNCLASS  I F I FD  MOLECULES  (2,6) 
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